Clustering Categorical Data based on Information Loss Minimization

نویسندگان

  • Periklis Andritsos
  • Panayiotis Tsaparas
  • Renée J. Miller
  • Kenneth C. Sevcik
چکیده

As the size of databases continues to grow, understanding their structure gets more difficult. This, together with the lack of documentation and the unavailability of the original designers of the database adds further difficulty to the job of researchers and professionals to understand the structure of large and complex databases. At the same time, data sources are distributed over several sites and their integration introduces anomalies and often results in “dirty” databases, i.e., databases that contain erroneous or duplicate data records. Our research focuses on the application of data mining, and in particular clustering techniques, to aid the process of recovering and understanding high-level views of data sets.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

ارائه یک الگوریتم خوشه بندی برای داده های دسته ای با ترکیب معیارها

Clustering is one of the main techniques in data mining. Clustering is a process that classifies data set into groups. In clustering, the data in a cluster are the closest to each other and the data in two different clusters have the most difference. Clustering algorithms are divided into two categories according to the type of data: Clustering algorithms for numerical data and clustering algor...

متن کامل

DIVCLUS-T: A monothetic divisive hierarchical clustering method

DIVCLUS-T is a divisive hierarchical clustering algorithm based on a monothetic bipartitional approach allowing the dendrogram of the hierarchy to be read as a decision tree. It is designed for either numerical or categorical data. Like the Ward agglomerative hierarchical clustering algorithm and the k-means partitioning algorithm, it is based on the minimization of the inertia criterion. Howev...

متن کامل

A Framework for Clustering Mixed Attribute Type Datasets

We propose a clustering framework that supports clustering of datasets with mixed attribute type (numerical, categorical), while minimizing information loss during clustering. Real world datasets such as medical datasets and its ontology have mixed attribute type datasets. However, most conventional clustering algorithms have been designed and applied to datasets containing only single attribut...

متن کامل

A cluster ensemble method for clustering categorical data

Categorical data clustering (CDC) and cluster ensemble (CE) have long been considered as separate research and application areas. The main focus of this paper is to investigate the commonalities between these two problems and the uses of these commonalities for the creation of new clustering algorithms for categorical data based on cross-fertilization between the two disjoint research fields. M...

متن کامل

A Link-Based Cluster Collection Approach Combined Contagious Cluster With For Categorical Data Clustering

Data clustering is a challenging task in data mining technique. Various clustering algorithms are developed to cluster or categorize the datasets. Many algorithms are used to cluster the categorical data. Some algorithms cannot be directly applied for clustering of categorical data. Several attempts have been made to solve the problem of clustering categorical data via cluster ensembles. But th...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003